Chi-square goodness-of-fit test (χ² GOF)#

The chi-square goodness-of-fit test answers:

“Do these observed category counts look like they came from a specified categorical distribution?”

It’s the right tool when you have:

  • one categorical variable (k categories)

  • counts per category (not measurements)

  • expected probabilities for each category under the null hypothesis


Learning goals#

By the end you should be able to:

  • set up \(H_0/H_1\) for χ² GOF

  • compute the χ² statistic, degrees of freedom, and a p-value

  • interpret the result correctly (what “reject” / “fail to reject” mean)

  • diagnose which categories drive the result (residuals + contributions)

  • implement the test from scratch with NumPy

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from math import erf, sqrt

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)

1) What it’s used for (and what it’s not)#

Used for#

You use χ² GOF when you want to compare observed counts to an expected distribution.

Examples:

  • Is a die fair (uniform probabilities across 6 faces)?

  • Does a website’s traffic split match a target mix (e.g., 50% mobile, 30% desktop, 20% tablet)?

  • Do survey responses match a claimed distribution?

Not the same as…#

  • χ² test of independence: compares two categorical variables via a contingency table.

  • KS / Anderson–Darling: goodness-of-fit for continuous distributions (without binning).

Conceptually, χ² GOF is a large-sample approximation to a multinomial model under \(H_0\).

2) Hypotheses#

Suppose there are \(k\) categories with expected probabilities:

\[ \mathbf{p}_0 = (p_{01}, \dots, p_{0k}),\quad \sum_{i=1}^k p_{0i}=1 \]

With observed counts \(\mathbf{O}=(O_1,\dots,O_k)\) and total \(n=\sum_i O_i\):

  • Null \(H_0\): the data were generated with probabilities \(\mathbf{p}_0\)

  • Alternative \(H_1\): at least one category probability differs from \(\mathbf{p}_0\)

This is a global test: it detects that something differs, then you inspect residuals to see where.

3) The test statistic (Pearson’s χ²)#

Under \(H_0\), the expected count in category \(i\) is:

\[ E_i = n\,p_{0i} \]

Pearson’s chi-square statistic is:

\[ \chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} \]

Why this form?#

  • \(O_i-E_i\) is the raw discrepancy.

  • Dividing by \(E_i\) scales it by the typical size of fluctuations (variance grows with \(E_i\)).

  • Squaring makes positive/negative deviations add up.

A helpful diagnostic is the standardized residual:

\[ R_i = \frac{O_i - E_i}{\sqrt{E_i}} \quad\Rightarrow\quad \chi^2 = \sum_i R_i^2 \]

So χ² is literally the sum of squared standardized residuals.

# Warm-up: see the pieces on a tiny example
observed_counts = np.array([18, 22, 20])
expected_probs = np.array([1 / 3, 1 / 3, 1 / 3])

n = observed_counts.sum()
expected_counts = n * expected_probs

differences = observed_counts - expected_counts
standardized_residuals = differences / np.sqrt(expected_counts)
contributions = differences**2 / expected_counts

np.column_stack(
    [observed_counts, expected_counts, differences, standardized_residuals, contributions]
)
array([[18.    , 20.    , -2.    , -0.4472,  0.2   ],
       [22.    , 20.    ,  2.    ,  0.4472,  0.2   ],
       [20.    , 20.    ,  0.    ,  0.    ,  0.    ]])

The columns above are:

  1. observed \(O_i\)

  2. expected \(E_i\)

  3. difference \(O_i-E_i\)

  4. standardized residual \(R_i=(O_i-E_i)/\sqrt{E_i}\)

  5. contribution \((O_i-E_i)^2/E_i\) (these sum to χ²)

4) Degrees of freedom#

If all \(k\) probabilities are fully specified under \(H_0\) (no parameters estimated from the data), then:

\[ \chi^2 \overset{H_0}{\approx} \chi^2_{\nu},\quad \nu = k-1 \]

Why \(k-1\) and not \(k\)?

  • The probabilities must sum to 1, so only \(k-1\) of the counts can vary freely.

If you estimated \(m\) parameters from the data (e.g., you fit a Poisson rate \(\lambda\) before comparing binned counts), you reduce df:

\[ \nu = (k-1) - m \]

This matters: estimating parameters makes the null more flexible, so you need fewer df.

5) Assumptions / when the approximation is ok#

χ² GOF relies on an asymptotic (large-sample) approximation.

Common practical checks:

  • Observations are independent.

  • Categories are mutually exclusive and collectively exhaustive.

  • Expected counts are not too small.

    • Rule of thumb: all \(E_i \ge 5\) (or at least 80% ≥ 5 and none < 1).

If expected counts are small:

  • combine rare categories, or

  • use a Monte Carlo / exact multinomial approach.

Also remember:

  • The test is sensitive to sample size: with large \(n\), tiny deviations can become statistically significant.

6) From-scratch implementation (NumPy-only core)#

We’ll implement:

  • χ² statistic and per-category contributions

  • degrees of freedom

  • p-value via Monte Carlo sampling from a χ² distribution (using only NumPy)

  • a fast p-value approximation (Wilson–Hilferty)

  • a simple effect size: Cramér’s V for GOF

We’ll then compare with SciPy as a practical reference.

def _as_1d_nonnegative(name, array_like):
    values = np.asarray(array_like, dtype=float)
    if values.ndim != 1:
        raise ValueError(f"{name} must be a 1D array-like.")
    if not np.all(np.isfinite(values)):
        raise ValueError(f"{name} must contain only finite values.")
    if np.any(values < 0):
        raise ValueError(f"{name} must be non-negative.")
    return values


def chi_square_gof_statistic(observed_counts, *, expected_probs=None, expected_counts=None):
    '''Compute Pearson's χ² GOF statistic and diagnostics.

    Provide either expected_probs (summing to 1) or expected_counts.
    '''

    observed_counts = _as_1d_nonnegative("observed_counts", observed_counts)
    total_count = float(observed_counts.sum())
    if total_count <= 0:
        raise ValueError("observed_counts must sum to > 0.")

    if expected_counts is None:
        if expected_probs is None:
            raise ValueError("Provide expected_probs or expected_counts.")

        expected_probs = _as_1d_nonnegative("expected_probs", expected_probs)
        if expected_probs.shape != observed_counts.shape:
            raise ValueError("expected_probs must have the same shape as observed_counts.")

        prob_sum = float(expected_probs.sum())
        if not np.isclose(prob_sum, 1.0):
            raise ValueError(f"expected_probs must sum to 1. Got {prob_sum}.")

        expected_counts = total_count * expected_probs
    else:
        expected_counts = _as_1d_nonnegative("expected_counts", expected_counts)
        if expected_counts.shape != observed_counts.shape:
            raise ValueError("expected_counts must have the same shape as observed_counts.")

        expected_sum = float(expected_counts.sum())
        if expected_sum <= 0:
            raise ValueError("expected_counts must sum to > 0.")

        # Rescale to match the observed total (helps with rounding or probabilities * n).
        expected_counts = expected_counts * (total_count / expected_sum)

    if np.any(expected_counts <= 0):
        raise ValueError("All expected counts must be > 0 (no zero-probability categories).")

    differences = observed_counts - expected_counts
    contributions = differences**2 / expected_counts
    chi2_stat = float(contributions.sum())

    standardized_residuals = differences / np.sqrt(expected_counts)

    return {
        "chi2": chi2_stat,
        "observed": observed_counts,
        "expected": expected_counts,
        "differences": differences,
        "contributions": contributions,
        "standardized_residuals": standardized_residuals,
        "n": total_count,
        "k": int(observed_counts.size),
    }


def chi_square_gof_df(k, ddof=0):
    df = int(k - 1 - ddof)
    if df <= 0:
        raise ValueError(
            f"Degrees of freedom must be positive. Got df={df} (k={k}, ddof={ddof})."
        )
    return df


def chi_square_sf_mc(x, df, *, n_sim=200_000, batch_size=50_000, rng=None):
    '''Monte Carlo estimate of P(ChiSq_df >= x).

    Uses ChiSq_df = sum_{j=1..df} Z_j^2 where Z_j ~ N(0,1).
    '''

    rng = np.random.default_rng() if rng is None else rng

    x = float(x)
    if x < 0:
        return 1.0, 0.0

    n_sim = int(n_sim)
    batch_size = int(batch_size)
    if n_sim <= 0 or batch_size <= 0:
        raise ValueError("n_sim and batch_size must be positive integers.")

    count_ge = 0
    remaining = n_sim

    while remaining > 0:
        current = min(batch_size, remaining)
        sims = np.square(rng.standard_normal(size=(current, df))).sum(axis=1)
        count_ge += int(np.count_nonzero(sims >= x))
        remaining -= current

    p_value = count_ge / n_sim
    se = sqrt(p_value * (1 - p_value) / n_sim)
    return p_value, se


def chi_square_sf_wilson_hilferty(x, df):
    '''Fast upper-tail approximation using the Wilson–Hilferty transform.'''

    x = float(x)
    df = float(df)

    if x <= 0:
        return 1.0

    z = ((x / df) ** (1 / 3) - (1 - 2 / (9 * df))) / sqrt(2 / (9 * df))
    phi = 0.5 * (1.0 + erf(z / sqrt(2.0)))
    return 1.0 - phi


def cramers_v_gof(chi2_stat, n, k):
    return sqrt(float(chi2_stat) / (float(n) * (k - 1)))


def chi_square_gof_test(
    observed_counts,
    *,
    expected_probs=None,
    expected_counts=None,
    ddof=0,
    alpha=0.05,
    n_sim=200_000,
    rng=None,
):
    details = chi_square_gof_statistic(
        observed_counts, expected_probs=expected_probs, expected_counts=expected_counts
    )

    df = chi_square_gof_df(details["k"], ddof=ddof)

    p_mc, p_mc_se = chi_square_sf_mc(details["chi2"], df, n_sim=n_sim, rng=rng)
    p_wh = chi_square_sf_wilson_hilferty(details["chi2"], df)

    return {
        **details,
        "df": df,
        "alpha": float(alpha),
        "p_value_mc": float(p_mc),
        "p_value_mc_se": float(p_mc_se),
        "p_value_wh": float(p_wh),
        "reject_h0_mc": bool(p_mc < alpha),
        "cramers_v": float(cramers_v_gof(details["chi2"], details["n"], details["k"])),
    }

7) Example: is a die fair?#

We roll a 6-sided die \(n\) times.

  • \(H_0\): each face has probability \(1/6\)

  • \(H_1\): at least one face differs

We’ll look at two datasets:

  1. a sample from a fair die (should usually not reject)

  2. a sample from a biased die (should reject)

faces = np.arange(1, 7)
uniform_probs = np.ones(6) / 6

n_rolls = 120

rng_example = np.random.default_rng(1)
observed_fair = rng_example.multinomial(n_rolls, uniform_probs)
observed_biased = rng_example.multinomial(n_rolls, np.array([0.10, 0.10, 0.10, 0.10, 0.10, 0.50]))

rng_mc_fair = np.random.default_rng(100)
rng_mc_biased = np.random.default_rng(101)

fair_result = chi_square_gof_test(observed_fair, expected_probs=uniform_probs, n_sim=200_000, rng=rng_mc_fair)
biased_result = chi_square_gof_test(observed_biased, expected_probs=uniform_probs, n_sim=200_000, rng=rng_mc_biased)

{
    "observed_fair": observed_fair,
    "chi2_fair": fair_result["chi2"],
    "p_mc_fair": fair_result["p_value_mc"],
    "reject_fair@0.05": fair_result["reject_h0_mc"],
    "observed_biased": observed_biased,
    "chi2_biased": biased_result["chi2"],
    "p_mc_biased": biased_result["p_value_mc"],
    "reject_biased@0.05": biased_result["reject_h0_mc"],
}
{'observed_fair': array([20, 27, 14, 26, 15, 18]),
 'chi2_fair': 7.5,
 'p_mc_fair': 0.18619,
 'reject_fair@0.05': False,
 'observed_biased': array([11, 15, 11, 12,  6, 65]),
 'chi2_biased': 123.6,
 'p_mc_biased': 0.0,
 'reject_biased@0.05': True}
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Sample from a fair die", "Sample from a biased die"),
)

for col, observed in [(1, observed_fair), (2, observed_biased)]:
    expected = observed.sum() * uniform_probs
    fig.add_trace(go.Bar(name="Observed", x=faces, y=observed), row=1, col=col)
    fig.add_trace(
        go.Bar(name="Expected (H0)", x=faces, y=expected, marker_color="rgba(0,0,0,0.25)"),
        row=1,
        col=col,
    )

fig.update_layout(
    title="Observed vs expected counts (χ² GOF setup)",
    barmode="group",
    legend_title_text="Counts",
)
fig.update_xaxes(title_text="Face", row=1, col=1)
fig.update_xaxes(title_text="Face", row=1, col=2)
fig.update_yaxes(title_text="Count", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=2)

fig.show()

Which categories drive the result?#

Two useful per-category diagnostics:

  • Standardized residuals: \(R_i=(O_i-E_i)/\sqrt{E_i}\) (signed)

  • Contributions: \((O_i-E_i)^2/E_i\) (non-negative)

Large absolute residuals / contributions point to categories that disagree most with \(H_0\).

fig = make_subplots(
    rows=2,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.1,
    subplot_titles=("Standardized residuals (signed)", "χ² contributions (sum = χ²)"),
)

fig.add_trace(
    go.Bar(x=faces, y=biased_result["standardized_residuals"], name="Residual"),
    row=1,
    col=1,
)
fig.add_hline(y=0, line_width=1, line_color="black", row=1, col=1)

fig.add_trace(
    go.Bar(x=faces, y=biased_result["contributions"], name="Contribution", marker_color="#636EFA"),
    row=2,
    col=1,
)

fig.update_layout(title="Diagnostics for the biased sample")
fig.update_xaxes(title_text="Face", row=2, col=1)
fig.update_yaxes(title_text="Residual", row=1, col=1)
fig.update_yaxes(title_text="Contribution", row=2, col=1)

fig.show()

Visualizing the p-value (null distribution)#

Under \(H_0\), the χ² statistic follows a χ² distribution (approximately).

A p-value is:

\[ \text{p-value} = \Pr(\chi^2_{\nu} \ge \chi^2_{\text{obs}} \mid H_0) \]

So it’s a tail probability under the null model.

df = biased_result["df"]
chi2_obs = biased_result["chi2"]
alpha = biased_result["alpha"]

rng_null = np.random.default_rng(123)
null_samples = np.square(rng_null.standard_normal(size=(60_000, df))).sum(axis=1)

critical_value = float(np.quantile(null_samples, 1 - alpha))

fig = go.Figure()
fig.add_histogram(
    x=null_samples[null_samples < chi2_obs],
    nbinsx=60,
    name="Null samples (< observed)",
    marker_color="lightgray",
    opacity=0.8,
)
fig.add_histogram(
    x=null_samples[null_samples >= chi2_obs],
    nbinsx=60,
    name="Tail samples (≥ observed)",
    marker_color="crimson",
    opacity=0.9,
)

fig.add_vline(
    x=chi2_obs,
    line_width=3,
    line_color="crimson",
    annotation_text=f"Observed χ² = {chi2_obs:.2f}",
    annotation_position="top right",
)
fig.add_vline(
    x=critical_value,
    line_width=2,
    line_dash="dash",
    line_color="black",
    annotation_text=f"Critical (α={alpha:.2f}) ≈ {critical_value:.2f}",
    annotation_position="top left",
)

fig.update_layout(
    title=f"Null distribution for χ²(df={df}) with observed statistic (biased sample)",
    barmode="overlay",
    xaxis_title="χ² statistic",
    yaxis_title="Frequency",
)

fig.show()

{
    "df": df,
    "chi2_obs": chi2_obs,
    "p_value_mc": biased_result["p_value_mc"],
    "p_value_mc_se": biased_result["p_value_mc_se"],
    "p_value_wh_approx": biased_result["p_value_wh"],
}
{'df': 5,
 'chi2_obs': 123.6,
 'p_value_mc': 0.0,
 'p_value_mc_se': 0.0,
 'p_value_wh_approx': 0.0}

8) How to interpret the result#

Pick a significance level α (often 0.05).

  • If p-value < α: reject \(H_0\)

    • The observed counts are unlikely if the expected distribution were true.

    • Interpretable as evidence of a mismatch.

  • If p-value ≥ α: fail to reject \(H_0\)

    • The data are reasonably consistent with the expected distribution.

    • This is not proof that \(H_0\) is true (you may have low power).

What it does not mean#

  • The p-value is not \(\Pr(H_0 \mid \text{data})\).

  • A non-significant p-value does not mean “the distributions are identical”.

Effect size (optional but useful)#

For goodness-of-fit, a common effect size is Cramér’s V:

\[ V = \sqrt{\frac{\chi^2}{n(k-1)}} \]
  • \(V \approx 0\) means very close to the expected distribution.

  • For fixed proportional deviations, χ² grows with \(n\), but \(V\) stays roughly stable.

9) Example: known non-uniform expected mix (candy colors)#

Suppose a brand claims the following color mix:

  • Blue 24%, Orange 20%, Green 16%, Yellow 14%, Red 13%, Brown 13%

We sample a bag and count colors. Are the counts consistent with the claim?

colors = np.array(["Blue", "Orange", "Green", "Yellow", "Red", "Brown"])
expected_probs_claim = np.array([0.24, 0.20, 0.16, 0.14, 0.13, 0.13])

# A sample that is slightly different from the claim
observed_colors = np.array([40, 50, 30, 28, 28, 24])

rng_mc_colors = np.random.default_rng(202)
color_result = chi_square_gof_test(
    observed_colors,
    expected_probs=expected_probs_claim,
    n_sim=250_000,
    rng=rng_mc_colors,
)

{
    "observed": dict(zip(colors, observed_colors.astype(int))),
    "chi2": color_result["chi2"],
    "df": color_result["df"],
    "p_value_mc": color_result["p_value_mc"],
    "reject@0.05": color_result["reject_h0_mc"],
    "cramers_v": color_result["cramers_v"],
}
{'observed': {'Blue': 40,
  'Orange': 50,
  'Green': 30,
  'Yellow': 28,
  'Red': 28,
  'Brown': 24},
 'chi2': 4.266025641025641,
 'df': 5,
 'p_value_mc': 0.511144,
 'reject@0.05': False,
 'cramers_v': 0.06531481945948898}
expected_colors = color_result["expected"]

fig = go.Figure()
fig.add_bar(name="Observed", x=colors, y=observed_colors)
fig.add_bar(name="Expected (claim)", x=colors, y=expected_colors, marker_color="rgba(0,0,0,0.25)")

fig.update_layout(
    title="Candy colors: observed vs expected",
    barmode="group",
    xaxis_title="Color",
    yaxis_title="Count",
)

fig.show()

10) Sample size sensitivity (why χ² can flag tiny deviations)#

If the true proportions differ from \(\mathbf{p}_0\) by some small amount, the χ² statistic tends to grow with \(n\).

We’ll keep the proportional deviation fixed and increase \(n\).

def counts_from_probs(n, probs):
    probs = np.asarray(probs, dtype=float)
    if not np.isclose(probs.sum(), 1.0):
        raise ValueError("probs must sum to 1")

    raw = n * probs
    counts = np.floor(raw).astype(int)
    remainder = int(n - counts.sum())
    if remainder > 0:
        frac = raw - np.floor(raw)
        add_to = np.argsort(frac)[::-1][:remainder]
        counts[add_to] += 1
    return counts


# Base expectation: uniform over 6 categories
p0 = np.ones(6) / 6

# A small deviation that still sums to 0
p_true = p0 + np.array([0.02, -0.01, 0.01, -0.02, 0.00, 0.00])

sample_sizes = np.array([60, 120, 240, 500, 1000, 2000, 5000])
chi2_values = []
p_values = []

from scipy.stats import chi2

for n in sample_sizes:
    observed = counts_from_probs(int(n), p_true)
    result = chi_square_gof_statistic(observed, expected_probs=p0)
    df = chi_square_gof_df(result["k"])

    chi2_values.append(result["chi2"])
    p_values.append(float(chi2.sf(result["chi2"], df)))

fig = make_subplots(rows=1, cols=2, subplot_titles=("χ² grows with n", "p-value shrinks with n"))

fig.add_trace(go.Scatter(x=sample_sizes, y=chi2_values, mode="lines+markers", name="χ²"), row=1, col=1)
fig.add_trace(
    go.Scatter(x=sample_sizes, y=p_values, mode="lines+markers", name="p-value", marker_color="crimson"),
    row=1,
    col=2,
)

fig.update_xaxes(title_text="n", row=1, col=1)
fig.update_xaxes(title_text="n", row=1, col=2)
fig.update_yaxes(title_text="χ²", row=1, col=1)
fig.update_yaxes(title_text="p-value", type="log", row=1, col=2)

fig.update_layout(title="Same proportional deviation, different sample sizes")
fig.show()

11) Practical check with SciPy#

SciPy has a built-in implementation that matches the formula:

  • scipy.stats.chisquare(f_obs, f_exp, ddof=...)

It returns the χ² statistic and an exact p-value from the χ² distribution.

from scipy.stats import chisquare

chi2_scipy, p_scipy = chisquare(observed_biased, f_exp=(observed_biased.sum() * uniform_probs))

{
    "numpy_stat": biased_result["chi2"],
    "scipy_stat": float(chi2_scipy),
    "scipy_p_value": float(p_scipy),
}
{'numpy_stat': 123.6,
 'scipy_stat': 123.6,
 'scipy_p_value': 5.419327083852984e-25}

12) Checklist + common pitfalls#

  • Use counts, not percentages (convert percentages to counts by multiplying by \(n\)).

  • Make sure expected probabilities sum to 1 and all expected counts are > 0.

  • Watch out for small expected counts (combine categories or use Monte Carlo / exact).

  • If you estimated parameters to build \(\mathbf{p}_0\) from the data, adjust df: \(\nu=(k-1)-m\).

  • A significant result doesn’t tell you which categories differ—use residuals / contributions.


Exercises#

  1. Simulate many fair-die experiments and estimate the false-positive rate at α=0.05.

  2. Create a biased die with a subtle bias (e.g., 0.18 on one face) and see how large \(n\) must be to detect it reliably.

  3. Try binning continuous data into categories and apply χ² GOF, adjusting df if you estimated parameters.

References#

  • Pearson, K. (1900). On the criterion… (original χ² idea)

  • Any introductory statistics text on chi-square tests